-
-
Notifications
You must be signed in to change notification settings - Fork 14.4k
Adding a new offload_args intrinsic, which only maps arguments #150683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
020f669 to
555131e
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
72feb1c to
b6f4295
Compare
This comment has been minimized.
This comment has been minimized.
020f669 to
0228337
Compare
|
r? @oli-obk |
|
|
|
|
||
| #[rustc_nounwind] | ||
| #[rustc_intrinsic] | ||
| pub const fn offload_args<F, T: crate::marker::Tuple, R>(f: F, args: T) -> R; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a blocker for reviews.^^
@rustbot author
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Details ^^ Here you go.
@RalfJung I'm somewhat concerned whether people could write host code in the wrapper, which breaks if we map arguments to the GPU. I will make a follow-up PR, in which I will add support for forwarding host arguments that should not be mapped. In the case of the rocBLAS test, those would be alpha and beta. Once we have that, I think I could prohibit calling a Rust function, and enforce that f is a ForeignFn to prevent this risk, if you think it's needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know enough about this to have an opinion on that question.^^
But we can bikeshed the intrinsic as a whole a bit.
- Why do we need both
offloadandoffload_args? - Given that
offloadalso has arguments, the name doesn't seem very descriptive. Maybeoffload_with_host_helperor so? Just spitballing here as I have know idea how this all actually gets used. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offload is there to map all arguments to the GPU, and then launch a GPU kernel. It has a couple of extra args, to tell offload (ol) how many threads and blocks we want when launching it. That gpu kernel comes from a Rust function, which we compiled to the GPU.
offload_args also maps all the arguments to the GPU, and then launches a CPU function, which accepts pointers to GPU memory. That's relying on the called CPU function independently launching a GPU kernel. A lot of highly optimized libraries like cuBLAS, rocBLAS, cuDNN, etc. already do that. If you try to launch them with offload you will get a failure, because they are not meant to be launched from the GPU. If you launch them but pass CPU pointers, you will also get a failure, because all arrays are supposed to be in GPU memory. In the past NVIDIA used to partly support calling those functions from the GPU, but they dropped support. I could probably try to unify both offload intrinsics under one name, but with autodiff we also decided to split it into autodiff_forward and autodiff_reverse since they were behaving too differently, which otherwise made it confusing.
Under the hood, both create an offload_begin_mapper which copies data to the GPU, and an offload_end_mapper, which copies them back. The main intrinsic then also has a tgt_target_kernel_launch, which calls the GPU kernel. The offload_args intrinsic just calls the host function, but as you can see in the codegen test, we swap out the pointers before. This gives us the benefit of being able to optimize them both the same, so now both can use all the LLVM profiling/debugging/visualization tools (not yet tested, but should work) and both benefit from the LLVM opts that we started writing for them.
Happy to change the intrinsic name since we expect to have a pretty macro on top of it. My motivation for offload_args was, that this intrinsic only offloads arguments, whereas the full offload intrinsic offloads both the arguments and a function to the GPU.
Does that help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds a bit like offload_args is a lower-level primitive that could be used to implement offload... but it also seems like that's not actually how it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They share a lot of the implementation, that's why this PR is so small, especially once the first 3 refactor commits from a different PR are merged. If you look at the commits, you'll see that I've added a boolean host flag. If it is set to true, we directly call the host function given to the intrinsic. The compilation pipeline in this case is also shorten, and a few of our optional args are set to None, since we don't need to generate all that much info for offload.
If the host boolean is set to false, we generate a __tgt_target_kernel (I love their naming style) and launch the function given to the intrinsic through that. This requires that we have previously compiled the function to the GPU, and embedded it's bitcode into our IR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so... I take it that it does not make sense to implement offset on top of offset_args in "userland" (rather than in codegen).
Happy to change the intrinsic name since we expect to have a pretty macro on top of it. My motivation for offload_args was, that this intrinsic only offloads arguments, whereas the full offload intrinsic offloads both the arguments and a function to the GPU.
I see. Happy to take input some more folks on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not that familiar with offload, so I’ll need help on the details, but this is how I understood it:
There are two (and a half?) primitive operations involved:
- Converting a CPU pointer to a GPU pointer (and the other half: converting it back)
- Launching a kernel on a GPU with some given arguments
The current offload implements both of these. However, for using cuda/rocm libraries, you want to be able to only do 1 without also doing 2, right?
On a low level, 1 and 2 really are orthogonal operations, so it would make sense to split them into two separate things.
Though the details of actually implementing that are more complicated. offload is a rather high-level api that hides things going on underneath, including cpu vs gpu memory/pointers.
The interface of doing the copy to and back around a closure seems nice, it looks like scoped threads and similar patterns.
Does it anything else than copy memory? Then we could also call it offload_copy_mem_args, offload_mem or offload_ptrs (or maybe something with ‘scoped’?).
I think we can the expose the two primitive operations as intrinsics:
/// Transfer memory referenced by args back and forth
pub fn offload_copy_mem_args<F, T: crate::marker::Tuple, R>(f: F, args: T) -> R;
/// Call a kernel.
/// Does **not** copy memory (note: also no restriction on T)
pub fn offload_call_kernel<F>(args: T) -> R;This composes nicely to implement offload in code:
pub fn offload<F, T: crate::marker::Tuple, R>(f: F, args: T) -> R {
offload_copy_mem_args(|offloaded_args| {
offload_call_kernel(f, offloaded_args)
})
}And it gets rid of the bool argument to trigger one or the other implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replying here since it seems you posted that in the wrong thread
Also for the vision, I want these two intrinsics to be the default, which should (thanks to our new llvm opts) almost always be fast enough. However, we'll also offer explicit host2device, device2host memtransfers for users. oli-obk had a nice idea for a thin wrapper type that represents that a type is now on the GPU. In that case we could prevent usages on the CPU, and enforce that people only pass data to the GPU kernel which is already on the GPU.
But I see this third explicit mode (well it's not really a fully orthogonal mode, more like an extension) a bit similar to unsafe, it should hopefully be rarely necessary.
This sounds like you are discussing the public, user-visible API. In this PR we are discussing the underlying language primitives (intrinsics) powering that API. Those don't have the same concerns. No user should ever directly call these intrinsics. The intrinsics should be designed for
- ease of specification
- ease of implementation in backends
0228337 to
e0fd81c
Compare
|
This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
e0fd81c to
1a22802
Compare
|
☔ The latest upstream changes (presumably #151395) made this pull request unmergeable. Please resolve the merge conflicts. |
| /// } | ||
| /// | ||
| /// #[cfg(not(target_os = "linux"))] | ||
| /// fn sgemv_wrapper(A: &[f32; 6], x: &[f32; 3], y: &mut [f64; 2]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A here is a reference to invalid memory, as the pointer behind A is only valid on the GPU but this code runs on the CPU (also the cfg(not(target_os should probably be cfg(target_os without the not).
It works out fine here because A is never accessed, but I think it violates Rust’s guarantees.
To be safe, offload_args needs to map the type as well. I.e.
offload_args::<_, (&T, &[T])>(|t: *const T, ts: (*const T, usize)| {}, (t, &[t]));
// etc. for other mapped typesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right, that's what I was mentioning above to Ralf: https://github.com/rust-lang/rust/pull/150683/changes#r2706310846 That means some safe usages can be currently unsound till the follow-up pr lands, but it's a half implemented rustc intrinsic (which end-users aren't supposed to use anyway), so I'd argue it's ok. The alternative is to make this PR bigger, but it clearly already takes a while to review, so I'd rather have that separate.
Generally, I am very willing to put in significant effort to preserve noalias for the GPU kernels, which your type mapping I think does not do. If you watch the slides from my US LLVM dev talk, they show how I intend to keep it even for mut args like &mut [f32].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for the vision, I want these two intrinsics to be the default, which should (thanks to our new llvm opts) almost always be fast enough. However, we'll also offer explicit host2device, device2host memtransfers for users. oli-obk had a nice idea for a thin wrapper type that represents that a type is now on the GPU. In that case we could prevent usages on the CPU, and enforce that people only pass data to the GPU kernel which is already on the GPU.
But I see this third explicit mode (well it's not really a fully orthogonal mode, more like an extension) a bit similar to unsafe, it should hopefully be rarely necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That link doesn't work here so I am not sure what you are referring to. But I definitely didn't realize that the example contains unsound code, or else I would have made a fuzz about it. ;)
some safe usages can be currently unsound
The problem with the example isn't about being safe or not, it's about having UB. Examples with UB without a big fat warning sign are not okay.
Is there a better example that avoids the UB? Like, if you made all these references into raw pointers, would that work?
Also, this affects not just A but also x and y, wouldn't it? They are all GPU pointers now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gh being lovely again. I was referring to my comment here:
@RalfJung I'm somewhat concerned whether people could write host code in the wrapper, which breaks if we map arguments to the GPU.
Yes, A,x,y are GPU pointers. That was why I had asked you about only supporting extern ForeignFn once the follow-up pr lands. The easiest solution for now is indeed to move this to raw pointers, till we have either of the follow-up pr's landed which make it safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intrinsic is unsafe, so it's entirely fine if some uses of it cause UB. But the docs for the intrinsic should be clear about what exactly is or isn't UB, or if that is yet to be determined then they should be clear about that.
Also, the example shouldn't be UB. :)
This intrinsic helps with supporting the various AMD & NVIDIA libraries like rocBLAS or cuBLAS.
They provide functions which must be called from the host, but require a mixture of host and device pointers.
This offload_args intrinsic maps our host allocations to device allocations and transfers memory as required.
It reuses the whole infrastructure which we already have for the main offload intrinsic.
Unlike the main offload intrinsic, this also already fully works with std. I also got it to work with a single cargo invocation:
RUSTFLAGS="-L native=/opt/rocm-6.4.0/lib -l dylib=rocblas -l dylib=amdhip64 -l dylib=omp -l dylib=omptarget -Zoffload=Args -Zunstable-options" cargo +offload run -rI'll rebase and drop the first 3 commits once the cleanup PR lands.
I updated
compiler/rustc_monomorphize/src/collector/autodiff.rs, it now works without no_mangle, otherwise the function won't be codegen'ed. It also works without lto=fat if we only have main.rsIf we put a function in lib.rs and call it in main.rs, then it currently trips the verifier. Happy to fix it either here or in a follow-up PR:
cc @kevinsala @Sa4dUs